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1.  Introduction 

To  solve  many  practical  complex  optimization  problems  often  results  in  having  to  minimize  a 
nonsmooth  function  /.  The  graph  of  such  a  function  of  n  variables  appears  V-shaped  in  some 
directions  while  in  directions  orthogonal  to  these  the  graph  of  /’ s  closely  related  “U-Lagrangian” 
[9]  is  U-shaped.  To  be  efficient  a  minimization  algorithm  needs  to  approximate  the  correct  “VU- 
space  decomposition”  and  any  existing  second  order  information  on  the  U-subspace  via  some  kind 
of  corresponding  combination  of  polyhedral(V-shaped)  and  quadratic  (U-shaped)  approximation. 
For  a  convex  function  /  the  relevant  second  order  information  is  contained  in  the  “U-Hessian”,  i.e. 
the  Hessian  of  its  U-Lagrangian.  An  initial  version  of  a  VU-algorithm  that  does  this  for  convex 
functions  was  published  in  [20].  It  depends  on  approximating  points  on  “primal-dual  tracks”  [18] 
where  U-Hessians  exist.  Such  approximation  is  based  on  two  fundamental  results  depending  on 
proximal  point  and  bundle  method  theories  [6].  The  first  is  the  result  from  [20]  that  says  that  if 
there  exists  a  primal  track  leading  to  a  minimizer  x  at  which  a  nondegeneracy  asssumption  holds 
and  a  varying  proximal  point  parameter  p  =  p( x)  satisfies  p{x)\x  —  x\  — »  0  as  x  — »  x,  then  for  all  x 
sufficiently  close  to  x  the  proximal  point  of  x,  denoted  by  p^(x),  is  on  the  primal  track.  The  second 
result  from  [3]  implies  that  for  any  p  >  0  and  x  €  3?”  P/j.(x)  can  be  approximated  with  arbitrary 
precision  by  a  finite  sequence  of  bundle  algorithm  steps.  In  this  context  x  is  called  a  bundle  center. 
Moreover,  because  a  bundle  method  [6,  1]  employs  V-shaped  (or  cutting-plane)  appproximation, 
based  on  gathering  together  a  set  of  subgradients,  it  can  provide  local  VU-decomposition  basis 
matrices  and  “U-gradients”  (i.e.  gradients  of  U-Lagrangians)  without  having  to  know  explicitly 
the  underlying  structure  of  /. 

The  basic  VU-algorithm  alternates  V-steps  (or  corrector  steps)  with  U-steps  (or  predictor  steps) 
in  an  attempt  to  follow  an  unknown  primal  track.  A  U-step  is  a  Newton  step  (depending  on 
local  first  and  second  order  U-derivative  approximations)  from  a  primal  track  point  estimate  p 
(approximating  p^,(x))  to  the  next  bundle  center  x+J  a  point  that  could  be  a  relatively  poor 
estimate  of  the  next  primal  track  point.  A  next  V-step  is  the  final  substep  in  a  sequence  of  bundle 
algorithm  substeps.  It  gives  a  correction  step  from  x+  to  the  next  primal  track  point  estimate  p+. 
The  VU-basis  matrices  needed  to  form  these  steps  are  obtained  as  a  by-product  from  computing 
the  vector  with  smallest  Euclidean  norm  in  the  convex  hull  of  the  subgradients  active  in  the  bundle 
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subroutine  generation  of  p.  This  short  vector  is  denoted  by  s  and  is  an  estimate  of  a  dual  track 
point.  The  U-gradient  required  for  calculation  of  the  U-Newton  step  is  the  U-projection  of  s. 

To  prove  convergence,  even  when  a  primal  track  does  not  exist,  there  is  a  line  search  on  the  line 
segment  from  p  to  p+  whenever  f(p+)  is  not  sufficiently  smaller  than  f{p)  relative  to  a  multiple 
of  |s|2.  This  search  generates  a  replacement  for  x+  such  that  its  corresponding  p+,  coming  from 
a  second  bundle  subroutine  run,  does  satisfy  sufficient  /-decrease. 

The  line  search  recursively  builds  a  one-dimensional  V-model  of  /  and  stops  when  this  model 
is  as  accurate  as  the  n-dimensional  V-model  generated  by  the  bundle  run  that  produced  p.  It  also 
develops  a  single  variable  U-model  as  in  [8].  Information  gained  from  this  search  is  then  used  in 
deciding  a  new  (usually  smaller)  value  for  p  and  in  initializing  the  bundle  of  subgradients  for  the 
second  bundle  run  of  the  current  iteration. 

Also,  the  algorithm  includes  another  line  search  which  may  change  x+  before  the  first  bundle 
run  in  an  iteration.  This  extrapolation  line  search  along  the  U-step  direction  from  p  is  called  for 
when  a  Wolfe  directional  derivative  increase  test  [1]  is  not  satisfied  at  x+.  This  search  continues 
until  such  a  test  is  satisfied  and  then  x+  is  redefined  to  be  the  search  point  found  with  the  least 
/-value.  Data  from  this  search  goes  into  possibly  changing  the  value  of  p  and  into  initializing  the 
bundle  for  the  first  bundle  run.  Moreover,  at  the  first  iteration,  if  an  extrapolation  line  search  is 
not  called  for,  then  an  interpolation  line  search  is  called  in  order  to  obtain  data  for  choosing  an 
initial  value  p  =  p-[ ,  so  that  the  user  does  not  have  to  input  this  value.  Any  line  search  called  for 
by  the  algorithm  appends  up  to  two  new  subgradients  to  the  bundle. 

In  addition  to  the  above  features,  there  is  a  special  test  for  trying  to  detect  that  /  is  smooth  in 
the  sense  that  at  a  minimizer  its  U-space  dimension  equals  n.  When  this  /-descent  test  is  satisfied 
this  results  in  p+  being  set  to  x+,  i.e.  there  is  no  bundle  run  when  the  primal  track  is  estimated 
to  be  an  n-dimensional  ball  about  a  minimizer. 

For  various  proofs,  the  initial  version  of  the  algorithm  kept  p  <  p  where  p  is  a  fixed  upper 
bound  on  the  settings  of  p  for  all  bundle  runs.  The  current  version  does  this  for  all  needed  second 
bundle  runs,  but  allows  the  p- sequence  corresponding  to  first  bundle  runs  to  go  to  infinity.  Lemma 
17  in  [20],  dealing  with  superlinear  convergence,  gives  an  implementable  rate  at  which  p  should 
increase  when  a  primal-dual  track  is  approximated  sufficiently  well,  but  for  overall  efficiency  and 
convergence  the  algorithm  was  modified  to  include  heuristics  to  keep  p  from  getting  too  large  too 
soon  in  a  run.  A  new  proof  of  convergence  was  produced  to  deal  with  this  simple  modification.  The 
observed  numerical  advantage  for  allowing  p  to  increase  significantly  at  the  end  of  a  run  is  that 
the  bundle  subproblems  do  not  become  more  difficult  to  solve  as  the  iteration  number  increases. 
This  is  in  extreme  contrast  to  earlier  proximal  point  methods  [21]  that  had  the  parameter  go  to 
zero  to  obtain  superlinear  convergence  at  the  cost  of  having  the  subproblems  become  increasingly 
more  difficult  to  solve. 
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Some  numerical  reults  produced  by  the  above  described  algorithm  are  shown  in  Table  1  and 
discussed  next: 
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Table  1:  Summary  of  numerical  results 


The  bottom  three  rows  of  the  table’s  first  column  name  three  different  methods  applied  to  six 
test  functions  that  are  named  in  the  top  row  of  the  next  six  columns.  N1CV2  is  a  FORTRAN 
implementation  of  the  good  non-VU  proximal  bundle  method  of  Lemarechal  and  Sagastizabal 
[10].  The  code  N.VU  is  written  in  MATLAB,  except  for  its  FORTRAN  quadratic  progamming 
subroutine  taken  from  N1CV2.  The  leading  N.  refers  to  a  Newton  version,  meaning  that  when  it  is 
applied  to  a  finite  max  function  the  Hesssian  of  a  subfunction  that  is  active  at  a  point  is  provided  by 
the  function  evaluation  subroutine.  At  each  iteration  a  U-Hessian  estimate  is  computed  depending 
on  a  current  U-projectecl  convex  combination  of  subfunction  Hessians  as  described  in  [20].  The 
last  method,  qN.VU,  is  a  first  attempt  at  a  quasi-Newton  version  of  N.VU.  It  does  not  use  second 
derivative  information,  but  instead  uses  a  current  U-projected  BFGS-updated  n  by  n  matrix 
intended  to  approximate  the  Hessian  of  the  kind  of  Lagrangian  in  Theorem  7.2  of  [19]. 

In  the  table,  numbers  under  f/g  indicate  the  total  number  of  function  evaluations  for  a  run. 
Each  such  evaluation  also  includes  the  evaluation  of  one  subgradient  and,  in  the  case  of  the  Newton 
version,  includes  the  evaluation  of  one  subfunction  Hessian.  The  label  Ac  indicates  an  accuracy 
measure  that  is  the  number  of  correct  digits  in  the  best  /-value  found.  Except  for  3d-Ex  the  test 
functions  are  the  convex  finite  max  functions  from  [20] . 

The  convex  function  3d-Ex  is  given  by 

f(x i,x2,x3)  =  (l/2)x\  +  (1/2) ^ ((xj  -  2x2)2  +  (£3  -  x2)2). 

It  was  created  for  use  in  conference  presentations  and  in  future  publications  to  illustrate  graphically 
various  concepts,  such  as  a  primal  and  dual  tracks.  This  nonsmooth  function  is  the  maximum 
eigenvalue  of  a  symmetric  2  by  2  matrix  whose  three  distinct  entries  are  C2-functions  on  R3 .  As 
such,  this  infinite  max  function  is  in  the  very  broad  class  of  functions  with  primal-dual  gradient 
(pdg)  structure  first  defined  in  [17]  and  further  developed  in  [19].  In  [18]  such  functions  are  shown 
to  have  primal-dual  tracks  under  certain  conditions  including  “strong  transversality” .  Also,  3d- 
Ex  can  be  viewed  as  a  quadratic  function  plus  the  Euclidean  norm  of  a  2-dimensional  vector 
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function.  N.VU  was  run  on  this  function  using  its  Hessian  at  points  where  its  square  root  function 
is  evaluated  at  a  positive  number  and  using  the  Hessian  of  the  first  term,  (l/2)xf,  otherwise. 

The  good  performance  of  the  quasi-Newton  version  relative  to  that  of  N1CV2  as  revealed  in 
Table  1  is  especially  encouraging,  because  it  is  the  type  of  algorithm  which  is  needed  to  minimize 
implicitly  defined  functions  resulting  from  applying  decomposition,  relaxation  and/or  dualization 
techniques  to  large  or  complex  mathematical  programming  problems. 

For  the  test  functions  with  known  positive  definite  U-Hessians  the  VU-codes  have  generated 
p-sequences  with  observed  numerical  superlinear  convergence.  For  all  such  functions  they  have 
reported  terminal  V  and  U  subspace  dimensions  equal  to  those  of  the  ones  for  the  known  optimal 
solutions. 

2.  First  year  findings  resulting  from  AFOSR  support 

AFOSR  supported  research  was  first  devoted  to  working  on  theory  for  algorithm  convergence 
and  application.  There  are  several  assumptions  in  [20]  for  showing  superlinear  convergence  of  the 
p-sequence  to  a  unique  minimizer  x.  One  of  the  two  most  important  of  these  is  the  condition 
that  eventually  each  bundle  run  generates  local  V  and  U  subspace  basis  matrices  that  converge 
to  ones  for  the  corresponding  subspaces  at  x.  The  Principal  Investigator  consulted  with  the  Key 
Senior  Investigator,  Claudia  Sagastizabal,  on  her  paper  with  A.  Daniilidis  and  M.  Solodov  [4] 
which  shows  this  condition  holds  for  certain  convex  max-functions.  In  addition  to  assuming  that 
the  bundle  center  is  close  enough  to  x,  the  authors  assume  that  the  prox-parameter  is  sufficiently 
large,  a  condition  that  fits  in  well  the  above  discussion  on  not  having  an  upper  bound  on  the 
p-value  chosen  for  the  first  bundle  run  for  each  iteration. 

The  PI  corrected  their  initial  result  concerning  which  generated  vectors  are  basis  vectors  for 
the  V-subspace  and  also  suggested  a  bundle  subroutine  stopping  test  (based  on  being  similar 
to  one  in  [20])  which  turned  out  to  be  better  numerically  than  their  original  one.  In  addition, 
the  PI  improved  their  discussion  of  the  structure  of  the  Ll-norm  function  used  as  an  example. 
This  improvement,  detailed  next,  validates  the  importance  of  pdg-structure  mentioned  in  the 
introduction.  The  definition  of  pdg-structure  depends  on  several  C 2  structure  functions,  denoted 
by  fj  and  <pj,  where  the  latter  type  are  essential  for  expressing  the  structure  of  infinite  max- 
functions  such  as  maximum  eigenvalue  functions.  For  f(x)  =  y//.,  |x,|  at  a  point  x  where  m 
out  of  the  n  xfs  are  zero  there  are  two  extreme  ways  to  express  the  pdg-structure  for  /  near 
x.  One  way  uses  2m  structure  functions  fj  based  on  the  classical  result  that  f  can  be  expressed 
as  the  pointwise  maximum  of  2m  C 2  functions  .  The  other  way  developed  by  the  PI  uses  only 
1  +  m  structure  functions,  denoted  by  fo  and  < f>j  for  j  =  1,2,  ...,m.  An  important  condition  that 
implies  the  existence  of  a  primal  track  to  x  is  strong  transversality.  For  the  first  structure  this 
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condition  says  that  at  x  the  gradients  of  the  2m  /;’s  are  affinely  independent.  For  the  second 
structure  the  condition  reduces  to  the  m  V0j(x)’s  being  linearly  independent  which  is  the  case 
for  this  /,  since  these  gradients  are  distinct  unit  vectors.  If  2m  is  larger  than  n  +  1  then  the  first 
structure  does  not  satisfy  strong  transversality,  so  the  second  structure  is  to  be  preferred.  In  any 
case  the  second  structure  appears  to  be  a  more  natural  representation  for  the  local  structure  of 
the  Ll-norm  function.  This  is  another  illustration  of  the  importance  of  having  <pj- type  structure 
functions  for  describing  the  local  structure  of  nonsmooth  functions. 

Also,  during  the  first  year  of  AFOSR  grant  research  more  work  was  done  on  the  quasi-Newton 
version  of  the  algorithm.  The  original  version  restarted  a  BFGS  update  process  with  a  scalar 
multiple  of  the  identity  matrix  whenever  the  U-dimension  estimate  (which  starts  at  n)  attained 
a  new  least  value.  The  current  version  does  not  restart,  but  instead  starts  at  iteration  3  or 
higher  when  two  distinct  p-iterates  and  corresponding  s-iterates  generate  a  large  enough  curvature 
estimate  to  be  used  as  an  initialization  scalar  multiplier.  This  simpler  process  runs  as  well  as  the 
original  one  on  the  Table  1  test  functions  and  on  several  others.  This  change  has  the  added  benefit 
of  allowing  for  fewer  and  less  complicated  heuristics  for  updating  the  prox-parameter  in  situations 
where  there  is  some  indication  that  its  value  is  relatively  too  large,  which  can  occur  when  the 
VU-dimensions  are  not  yet  optimal. 

Another  promising  quasi-Newton  update  from  [13]  was  also  tried.  However,  in  our  reduced 
Hessian  context  it  often  produced  ill-conditioned  U-Hessian  estimates  which  led  to  poor  overall 
performance  in  comparison  to  the  BFGS  update.  Perhaps,  this  indicates  that  in  the  future  the 
DFP  update  should  be  tried,  because  it  is  one  in  the  Broyden  class  that  is,  in  some  sense,  furthest 
away  from  the  new  update  in  [13] . 

Additional  testing  on  some  randomly  generated  functions  that  are  pointwise  maxima  of  quadrat¬ 
ics  found  a  few  functions  that  were  much  more  difficult  to  minimize  than  most  in  terms  of  total 
number  of  function  evaluations.  An  idea  that  led  to  overall  improvement  depends  on  the  observa¬ 
tion  that  for  some  bundle  subproblem  runs  the  final  iterate  that  satisfies  the  stopping  test  has  an 
f-value  that  is  worse  than  that  of  one  of  the  previous  bundle  iterates  that  was  active  in  generating 
the  last  iterate.  So,  the  bundle  subroutine  output  proximal  point  estimate  p  was  changed  to  be 
the  bundle  iterate  with  the  best  f-value  from  among  those  of  either  the  last  iterate  or  the  iterates 
active  in  generating  the  last  one.  This  change  could  be  even  more  important  for  extending  the 
algorithm  to  converge  to  stationary  points  for  nonconvex  functions,  a  subject  of  ongoing  research. 

3.  Second  year  research  findings 

During  the  second  year  of  research  more  progress  was  made  on  improving  the  performance  of  the 
BFGS  quasi-Newton  version  of  our  VU-algorithm  for  the  convex  case.  As  stated  in  the  introduction 
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x+,  equal  to  p  plus  its  U-Newton  step,  may  not  be  a  “good”  bundle  center  in  the  sense  that  its 
bundle  run  generated  prox  point  estimate  p+  has  such  a  large  /-value  that  it  causes  both  a  line 
search  and  an  additional  bundle  run  to  be  executed,  in  order  to  obtain  good  replacements  for  the 
values  of  x+  and  p+.  In  order  to  avoid  some  of  these  undesirable  iterations  having  two  bundle 
runs  there  is  now  a  heuristic  test  based  on  the  /-value  at  the  initial  value  of  x+.  One  possible 
outcome  of  the  test  is  to  declare  a  “bad”  value  of  x+  and  call  for  a  preemptive  line  search  from 
p  along  the  U-Newton  direction  to  find  a  replacement  point  that  has  a  very  good  /-value,  i.e. 
one  that  does  not  exceed  f(p).  This  action  is  taken  when  /( x+)  >  max{f(x),f(x_)}  where  x 
is  the  bundle  center  corresponding  the  current  p  and  X-  is  the  one  from  the  previous  iteration. 
Not  including  f(x-)  in  this  heuristic  test  led  to  poor  performance  on  some  test  functions  that 
either  did  not  have  a  primal  track  to  a  minimizer  or  did  not  have  a  positive  definite  U-Hessian 
at  a  minimizer.  The  current  code  actually  has  a  third  heuristic  number  to  test  against.  Research 
on  this  important  feature  will  be  revisited  when  a  future  version  of  the  algorithm  is  developed  for 
nonconvex  functions. 

The  quasi-Newton  version  of  the  algorithm,  with  the  test  for  f(x+)  being  too  large  and  with 
the  other  features  mentioned  above,  gave  very  good  numerical  results  on  several  test  functions. 
These  included  slight  improvements  for  the  functions  considered  in  Table  1.  For  example,  the 
new  numbers  for  the  3d-EX  column  are  35  14  and  for  the  3d-U0  column  are  30  15.  Also,  this 
version  provided  observed  superlinear  convergence  as  illustrated  below  in  Figure  1  where  it  is 
called  VU-BFGS.  This  figure  shows  the  results  of  running  two  BFGS-update  based  methods  on 
an  8  variable  nonsmooth  example  function  from  Lewis  and  Overton  [11].  The  function’s  minimum 
value  is  zero.  The  graph  shows  function  value,  on  a  log  base  10  scale,  depending  on  function 
evaluation  number,  i.e.  the  first  blue  and  red  marks  correspond  to  a  value  around  660,  the  249t/l 
red  dot’s  value  is  about  10-15  while  the  87th  blue  mark’s  value  is  near  10-16.  The  red  dots  were 
produced  by  the  smooth  BFGS  algorithm  with  line  searches  from  [11]  and  they  represent  observed 
R- linear  convergence.  The  blue  plus  symbols  come  from  VU-BFGS  and  they  illustrate  observed 
R.-superlinear  convergence. 

We  are  now  in  the  position  to  improve  the  numerical  behavior  of  the  algorithm  and  future 
versions  of  it,  even  for  the  nonconvex  case.  This  is  because  we  have  gotten  permission  to  use 
Kiwiel’s  best  computer  code  for  solving  the  sequence  of  quadratic  programming  (QP)  subproblems 
called  for  in  each  bundle  run.  This  code  is  a  very  accurate  one  based  on  a  specialized  QP  method 
in  [7].  It  also  has  the  facility  to  execute  a  “warm  start”  that  uses  an  ending  matrix  factorization 
for  one  QP  to  start  the  next  QP  solution  process. 

Also,  during  the  second  year  of  research  we  made  two  preliminary  advances  for  getting  a  handle 
on  the  noncovex  case.  One  is  the  improvement  of  a  code  [16]  for  local  minimization  of  a  non¬ 
convex  function  of  a  single  variable  which  can  be  used  for  executing  line  searches  called  for  by 
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Figure  1.  Comparison  of  Smooth  and  VU  Algorithms 

a  multivariable  algorithm  and  for  generating  useful  negative  curvature  information  when  such  is 
encountered.  The  code  is  based  on  the  method  in  [15]  which  combines  linear  and/or  quadratic 
approximation  to  obtain  superlinear  convergence  for  piecewise  C 2  functions.  The  improvement 
comes  from  more  use  of  quadratic  approximation  in  order  to  increase  the  likelihood  of  an  iterate 
being  on  the  opposite  side  of  a  local  minimizer  from  that  of  the  previous  iterate.  This  means  that 
curvature  information  on  either  side  of  a  local  minimizer  is  very  likely  to  be  updated  every  second 
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iteration  rather  than  having  a  sequence  of  two  or  more  consecutive  iterations  where  there  is  no 
update  on  a  given  side. 

The  other  significant  advance  comes  from  the  Pi’s  consultation  with  Claudia  Sagastizabal  on 
her  non-VU  composite  bundle  method  [23]  for  solving  certain  nonconvex  composite  problems  of 
the  form 


min  (h  o  c)(a:) 

where  c  :  3?"  — >  is  a  smooth  mapping  and  h  :  — >  3?  is  a  positively  homogeneous  convex 

function.  Such  a  structure  separates  the  two  difficulties  of  nonconvexity  and  nondifferentiability 
by  allowing  only  the  component  functions  of  c  to  be  nonconvex  and  only  h  to  be  nonsmooth.  This 
separation  along  with  positive  homogeneity  of  h  allows  the  algorithm  to  only  store  and  update  a 
bundle  of  subgradients  of  the  outer  function  h  rather  than  ones  from  the  objective  /  =  hoc.  In  [23] 
this  implementable  algorithm  is  shown  to  have  accumulation  points  that  are  Clarke  stationary  for 
locally  Lipschitz  functions.  It  is  a  specialized  version  of  a  conceptual  method  introduced  by  Lewis 
and  Wright  [12]  for  much  more  general  outer  functions  h.  These  methods  employ  certain  composite 
proximal  linearized  subproblems  each  of  which  involves  replacing  the  inner  smooth  mapping  c  by 
an  affine  approximation.  Because  of  this  linearization,  the  new  composite  bundle  algorithm  has 
a  type  of  step  not  present  in  previous  proximal  bundle  methods  that  only  have  serious  and  null 
steps.  This  third  type  of  step  is  called  a  backtrack  step.  It  is  like  a  null  step  in  that  it  does  not 
change  the  bundle  center,  but  unlike  one,  because  it  requires  the  bundle  parameter  to  be  increased 
strictly.  This  leads  to  the  interpretation  of  it  being  a  step  that  decreases  the  size  of  an  implicit 
trust  region  about  the  current  bundle  center. 

A  possible  first  step  in  a  forecast  future  progression  of  steps  to  learn  how  to  deal  with  non¬ 
convexity  within  a  VU-algorithm  would  be  to  determine  how  to  use  information  in  the  composite 
bundle  algorithm  to  make  V-space  approximations  and  where  to  add  moves  from  corresponding 
U-gradient  approximation. 

4.  Final  year  research  findings 

During  this  period  a  major  revision  of  Sagastizabal’s  composite  bundle  method  paper  [23] 
was  made  and  submitted  to  Mathematical  Programming.  It  now  contains  a  large  number  of 
computational  runs  on  an  extensive  variety  of  convex  and  nonconvex  test  functions.  This  paper’s 
algorithm  is  compared  with  three  other  methods,  including  the  smooth  BFGS  algorithm  with  a 
Wolfe  line  search  considered  in  Figure  1,  and  the  HANSO  package  hybrid  method  downloadable 
from  http://cs.nyu.edu/overton/software/index.html.  This  hybrid  is  the  BFGS  algorithm 
followed  by  the  Gradient  Sampling  algorithm  [2],  if  the  BFGS  terminal  point  is  not  satisfactory. 


9 


Even  though  BFGS  does  not  have  a  convergence  proof  for  nonsmooth  functions  and  it  has  been 
observed  to  fail,  it  does  have  adequate  accuracy  and  relatively  fast  convergence  for  some  functions. 
The  HANSO  method  converges  in  a  probabilistic  sense  to  Clarke  critical  points,  because  Gradient 
Sampling  has  this  property.  But  this  hybrid  method  is  the  most  expensive  one  in  the  foursome  in 
terms  of  total  number  of  functions  evaluations. 

Overall,  in  a  certain  average  sense  including  the  performance  measures  of  solution  accuracy, 
computer  time  used  and  number  of  function  evaluations,  Composite  Bundle  was  judged  to  be  the 
best  performer  among  the  four  solvers  tested.  Of  course,  a  different  conclusion  could  result  from 
running  on  a  different  set  of  test  functions. 

Composite  Bundle  did  have  some  difficulties  with  certain  function  and  starting  point  instances. 
There  were  stalls  resulting  from  consecutive  iterations  with  too  short  function- value-reducing 
serious  steps  or  with  null  and/or  back  track  steps  which  do  not  improve  the  objective  value.  Such 
behavior  results  in  the  plateaus  seen  in  Figure  2  below,  which  is  taken  from  [24]  and  is  discussd 
next. 


Figure  2.  Function  value  sequences  for  maxquad 

Figure  2  deals  with  the  nonsmooth  convex  test-function  called  maxquad  [1,  p.  153],  which  also 
is  considered  above  in  the  last  column  of  Table  1,  but  with  a  different  starting  point.  It  is  the 
pointwise  maximum  of  five  convex  quadratic  functions  on  SR10.  At  its  unique  optimal  solution, 
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four  of  the  quadratic  functions  are  active,  but  the  fifth  (inactive)  one  is  stiff,  making  maxquad  an 
interesting  academic  example  for  testing  different  algorithms.  It  is  more  difficult  to  minimize  than 
the  function  considered  in  Figure  1. 

For  four  different  algorithms  Figure  2  shows  function  value  accuracy,  relative  to  the  smallest 
value  ever  found.  The  horizontal  axis  gives  number  of  function  evaluations  referred  to  there 
as  black  box  calls.  In  addition  to  Composite  Bundle  (blue  circles)  and  smooth  BFGS  (black 
plus  symbols)  this  algorithm  group  contains  the  Proximal  Bundle[10]  code  N1CV2  (red  squares), 
considered  in  Table  1,  and  VU-bundle  (quasi-Newton)  (green  x’s),  called  VU-BFGS  in  Figure  1. 
The  VU-algorithm  can  be  thought  of  as  a  significant  improvement  of  both  the  Proximal  Bundle 
and  BFGS  methods,  since  it  combines  bundled  subgradients  of  the  former,  to  estimate  V  and  U 
subspaces,  together  with  use  of  the  BFGS  Hessian  update  formula  of  the  latter,  but  restricted  to 
the  U-subspace  where  Hessian  approximation  makes  sense.  A  possible  follow-up  research  topic 
would  to  be  to  make  a  similar  improvement  of  Composite  Bundle  by  adding  appropriate  U-steps  to 
its  serious  steps.  This  could  be  an  important  development  in  nonconvex  minimization,  but  only  in 
the  special  case  of  known  composite  structure  where  the  outer  function  is  positively  homogeneous 
and  convex.  This  could  provide  insight  into  the  noncovex  case  before  attacking  much  more  general 
nonconvex  functions  where  the  evaluation  black  box  only  provides  one  generalized  gradient  value 
with  each  function  evaluation. 

However,  we  speculated  that  the  key  idea  of  a  backtrack  step  from  the  Composite  Bundle 
method  could  provide  enough  insight  to  jump  to  the  quite  general  case  of  locally  Lipschitz  func¬ 
tions  that  are  semismooth  [14].  We  thought  it  could  be  the  needed  ingredient  to  make  a  successful 
VU-modification  of  Gupta’s  [5]  provably  convergent  algorithm  that  bundles  Hessians  together  with 
points  and  their  associated  generalized  gradients  (simply  called  gradients  from  here  on) .  The  bun¬ 
dled  Hessians  that  give  negative  curvature  information  can  be  used  to  construct,  V-models  with 
strictly  concave  “sides” .  This  can  be  done  algebraicly  by  modifying  bundle  subproblem  lineariza¬ 
tion  errors  via  the  addition  of  quadratic  terms  as  done  in  1980’s  AFOSR  supported  research  [15] 
for  the  single  variable  case. 

Embarking  on  this  research  fortuitously  led  to  the  most  significant  results  we  have  to  date  for 
the  noncovex  case  and  bode  well  for  proposed  research  under  our  new  AFOSR  grant.  So  far,  we 
have  discovered  how  to  define  a  non-VU  algorithm  with  finite  line  searches  and  overall  convergence 
to  local  minimizers  for  semismooth  functions  without  even  requiring  backtrack  steps.  The  only 
steps  needed  are  serious  and  null  ones  depending  on  newly  developed  Armijo  and  Wolfe  line  search 
tests  appropriately  generalized  for  nonsmoothness  and  nonconvexity.  Moreover,  the  Hessians  based 
on  negative  curvature  discovered  during  line  searches  need  not  be  stored  as  matrices,  since  they 
are  needed  only  for  matrix- vector  products.  Instead  pairs  of  point  and  corresponding  gradient 
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difference  vectors  can  be  stored  for  use  in  a  symmetric  rank  one  (SRI)  update  formula  starting 
from  the  zero  matrix. 

Future  research  will  investigate  if  the  assumption  of  /  being  lower  C ,  which  implies  semis¬ 
moothness,  may  be  sufficient  to  obtain  boundedness  of  the  negative  curvature  matrices  which  is 
a  condition  needed  for  convergence  proofs.  This  could  be  the  case,  because  such  a  function  has 
the  implicit  property  of  being  the  sum  of  a  convex  function  and  quadratic  function  [22].  Also, 
there  will  be  work  on  how  and  when  to  add  U-steps  to  the  serious  steps  of  the  basic  nonconvex 
algorithm  in  order  to  achieve  rapid  local  convergence  under  appropriate  assumptions. 
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